The main goal of this project is to find a prediction model for Manufacturer’s Suggested Retail Price (MSRP) based on features of an automobile.
Our dataset, called Car Features and MSRP, includes specifications such as make, model, release year, and transmission of certain automobiles released between 1990 and 2017. This data was originally gathered from Edmunds and Twitter by Sam Keene and posted on ‘Kaggle’ (kaggle.com/CooperUnion/cardataset) by ‘CooperUnion’.
This project can have multiple use cases:
There are 11,914 observations of 16 variables in the complete dataset. Among the variables we have horsepower, price, engine fuel type, release year, make, model, and engine cylinders. The relevant variables are explained later in the report. The response we are focusing on is “MSRP” which is Manufacturer’s Suggested Retail Price. Predictive modeling is a ubiquitous concept in today’s world, and is not restricted to Information Technology or financial markets. It has found usage across the world in almost every field. Our project can be an example of how the automobile industry can utilize this idea, though we are sure that it already is being used widely!
We decided to pursue this topic and dataset due to a shared interest in automobiles. Additionally, the potential real-world applications of this project are interesting. We are confident that we can leverage the knowledge we have gained over this semester on this project.
Installing packages (commented out) and loading libraries for future use.
#install.packages("mltools")
#install.packages("caret")
#install.packages("dplyr")
#install.packages("ggplots2")
#install.packages('DT')
library(mltools)
library(data.table)
library(knitr)
library(dplyr)
library(ggplot2)
library(caret)
library(faraway)
library(lmtest)
library(MASS)
library(DT)
The libraries loaded above are used throughout the report. They have been collected in a single chunk for ease of viewing and convenience.
Loading dataset in R from the CSV file, and removing some columns and rows which are not needed or not complete:
car_data = read.csv("Cars_data.csv")
temp = car_data
car_data = na.omit(car_data)
car_data$Model<- NULL
car_data$Vehicle.Style<- NULL
car_data$ Market.Category<- NULL
# Debugging commands:
#unique(car_data$Make)
#colnames(car_data)
#unique(car_data$Engine.Fuel.Type)
#unique(car_data$Driven_Wheels)
#unique(car_data$Vehicle.Size)
#car_data_info = car_data
#head(car_data)
#colnames(car_data)
We decided to remove the model predictor since it would potentially lead to over-complication of the model. It would be too specific.
Similar reasons for removing market category and vehicle style. Vehicle style as a variable is also somewhat similar to vehicle size, which further reduced its relevance in our mind. The values stored in the vehicle style column were also not easy to process.
The dataset car_data now contains the following variables:
Make - The car brand
Year - The year the car model was released
Engine.Fuel.Type - The Fuel which the car runs on
Engine.HP - The standard measure of power of the car’s engine
Engine.Cylinders - Number of Cylinders in the engine of the car
Transmission.Type - Specifies the transmission type, self explanatory
Driven_Wheels - Specifies to which wheels the car sends power
Number.of.Doors - Number of doors in the car - self explanatory
Vehicle.Size - Specifies the size of the car - compact, midsize, large
highway.MPG - Estimated miles per gallon the car travels on the highway
city.mpg - Estimated miles per gallon the car travels in the city
Popularity - Value assigned based on Twitter scraping - is part of the original dataset
MSRP - The response in our model. It stands for Manufacturer’s Suggested Retail Price.
We will now be considering all variables other than MSRP as potential predictors.
hist( car_data$MSRP, col = "lightblue", main = "Histogram of car_data MSRP Values")
It is clear that some of the MSRP values in this dataset are too high. We consider dropping extremely high and extremely low values of MSRP from the dataset we will use to train and test. Additionally, price is a much more important factor in the mass market than in the expensive/luxury segment so it is reasonable to restrict the price at $48,500. In fact, it may even be brought down further. - https://smallbusiness.chron.com/price-sensitivity-product-65805.html. Another factor in this decision was the fact that the dataset had very few observations of extremely high and extremely low MSRP. This exclusion of extreme values will help strengthen the model for the given range.
Removing extreme prices less than $8,500 and greater than $48,500
car_data_priced<-car_data[!(car_data$MSRP>48500 | car_data$MSRP< 8500 ),]
range(car_data_priced$MSRP)
## [1] 8548 48500
Once this is done, the MSRP Histogram looks like the following:
hist( car_data_priced$MSRP, col = "lightpink", breaks = 40, main = "Histogram of car_data_priced MSRP Values")
This is a much more reasonable distribution for our response variable.
Removing the non automatic/manual transmission types, and storing this new data in car_data_transd dataframe
This is done for simplicity and practicality. There are very few automobiles with non automatic/manual transmissions, so our model would probably not be accurate at predicting these even if we retained these values.
car_data_transd<-car_data_priced[!(car_data_priced$Transmission.Type=="AUTOMATED_MANUAL" | car_data_priced$Transmission.Type=="DIRECT_DRIVE" | car_data_priced$Transmission.Type=="UNKNOWN"),]
unique(car_data_transd$Transmission.Type)
## [1] "MANUAL" "AUTOMATIC"
Removing certain fuel types, keeping only gasoline and diesel. Storing the result in car_data_fuel dataframe
This is once again done for simplicity when it comes to predictors. We have a large dataset, so we can afford to omit certain types of automobiles. Additionally, since there were only few of non gasoline/diesel vehicles and we are targeting the mass market, this makes sense. We understand that the electric car segment is growing, and a more up-to-date dataset (this one spans all the way from 1990 to 2017) would help us cater to that market.
car_data_fuel<-car_data_transd[!(grepl("flex", car_data_transd$Engine.Fuel.Type, fixed = TRUE)
|car_data_transd$Engine.Fuel.Type=="electric" | car_data_transd$Engine.Fuel.Type=="" | car_data_transd$Engine.Fuel.Type=="natural gas"),]
unique(car_data_fuel$Engine.Fuel.Type)
## [1] "premium unleaded (required)" "premium unleaded (recommended)"
## [3] "regular unleaded" "diesel"
Assigning the different types of gasoline to a single “gasoline value”. Now, the only two values for fuel type will be “gasoline” and “diesel” as visible below
Here, we combine the different types of gasoline into one, since the different types of gasolines are still gasoline.
car_data_fuel$Engine.Fuel.Type[car_data_fuel$Engine.Fuel.Type == "premium unleaded (required)" ] <- "gasoline"
car_data_fuel$Engine.Fuel.Type[car_data_fuel$Engine.Fuel.Type == "regular unleaded" ] <- "gasoline"
car_data_fuel$Engine.Fuel.Type[car_data_fuel$Engine.Fuel.Type == "premium unleaded (recommended)" ] <- "gasoline"
unique(car_data_fuel$Engine.Fuel.Type)
## [1] "gasoline" "diesel"
Removing extremely rare brands in our dataset
We remove those car makes (brands) which occur less than 20 times out of the 7000+ observations we have in our current dataset.
big_brands = names(which(table(car_data_fuel$Make)>20))
small_removed = filter(car_data_fuel, Make %in% big_brands)
Assigning factors for categorical cariables and creating ReleasedYearsAgo
The ReleasedYearsAgo added variable is essentially how many years ago the model was released. It is the “Year” in the dataset subtracted from the current year
car_data_factored = small_removed
#car_data_factored <- car_data_fuel[!(as.numeric(car_data_fuel$Make) %in% which(table(car_data_fuel$Make)<100)),]
car_data_factored$Vehicle.Size <- factor(car_data_factored$Vehicle.Size)
car_data_factored$Transmission.Type <- factor(car_data_factored$Transmission.Type)
car_data_factored$Engine.Fuel.Type <- factor(car_data_factored$Engine.Fuel.Type)
car_data_factored$Driven_Wheels <- factor(car_data_factored$Driven_Wheels)
car_data_factored$Engine.Cylinders <- factor(car_data_factored$Engine.Cylinders)
car_data_factored$Number.of.Doors <- factor(car_data_factored$Number.of.Doors)
car_data_factored$Make <- factor(car_data_factored$Make, exclude = FALSE)
levels(car_data_factored$Vehicle.Size)
## [1] "Compact" "Large" "Midsize"
levels(car_data_factored$Transmission.Type)
## [1] "AUTOMATIC" "MANUAL"
levels(car_data_factored$Engine.Fuel.Type)
## [1] "diesel" "gasoline"
levels(car_data_factored$Driven_Wheels)
## [1] "all wheel drive" "four wheel drive" "front wheel drive"
## [4] "rear wheel drive"
levels(car_data_factored$Engine.Cylinders)
## [1] "3" "4" "5" "6" "8"
levels(car_data_factored$Number.of.Doors)
## [1] "2" "3" "4"
levels(car_data_factored$Make)
## [1] "Acura" "Audi" "BMW" "Buick"
## [5] "Cadillac" "Chevrolet" "Chrysler" "Dodge"
## [9] "FIAT" "Ford" "GMC" "Honda"
## [13] "Hyundai" "Infiniti" "Kia" "Land Rover"
## [17] "Lexus" "Lincoln" "Mazda" "Mercedes-Benz"
## [21] "Mitsubishi" "Nissan" "Oldsmobile" "Pontiac"
## [25] "Saab" "Scion" "Subaru" "Suzuki"
## [29] "Toyota" "Volkswagen" "Volvo"
car_data_factored$ReleasedYearsAgo <- with(car_data_factored, 2020 - Year)
Removing repetitive/unnecessary column(s)
car_data_factored$Year <- NULL
We create some plots which compare the potential predictors to the response variable (MSRP)
plot(MSRP ~ Engine.HP, data = car_data_factored,
main = "MSRP versus Engine Horse Power",
xlab = "Engine Horse Power",
col = "skyblue", pch = 20, las = 1, mgp = c(3,0.5,0))
options(scipen=5)
plot(MSRP ~ Engine.Fuel.Type, data = car_data_factored,
main = "MSRP versus Fuel Type",
xlab = "Fuel Type",
col = "slateblue2", pch = 20,las = 1, mgp = c(3,0.5,0))
plot(MSRP ~ Engine.Cylinders, data = car_data_factored,
main = "MSRP versus Engine Cylinders",
xlab = "Engine Cylinder Count",
col = "maroon", pch = 20, las = 1, mgp = c(3,0.5,0))
plot(MSRP ~ Transmission.Type, data = car_data_factored,
main = "MSRP versus Transmission Type",
xlab = "Transmission Type",
col = "coral2", pch = 20, las = 1, mgp = c(3,0.5,0))
plot(MSRP ~ Vehicle.Size, data = car_data_factored,
main = "MSRP versus Vehicle Size",
xlab = "Vehicle Size",
col = "steelblue", pch = 20, las = 1, mgp = c(3,0.5,0))
The graphs above verify our belief that the predictors in the dataset are definitely relevant in predicting the MSRP. Elaborating on some of the charts:
1) The Horsepower vs MSRP graph shows that the MSRP increases as the horsepower of the engine increases.
2) The Transmission Type vs MSRP graph shows that automatic vehicles are generally priced higher than manual ones.
3) The Cylinder vs MSRP plot shows that as the cylinder count in the engine increases, the MSRP increases.
These findings bolster our conviction for creating this model.
Splitting data into train and test:
set.seed(100)
#train-test split using 70%
samplesize = round(0.70*nrow(car_data_factored), 0)
index = sample(seq_len(nrow(car_data_factored)), size = samplesize)
data_train = car_data_factored[index,]
data_test = car_data_factored[-index,]
Creating a basic additive model:
msrp_mod_additive = lm(MSRP ~. , data_train)
summary(msrp_mod_additive)$adj.r.sq
## [1] 0.8247323
We get a fairly high value for Adjusted R-squared, which is encouraging. We now consider some other models as well
Creating Quadratic Model with AIC Step in both directions
MSRP_big_mod_poly = lm(
MSRP ~ . + I(Engine.HP ^ 2) + I(ReleasedYearsAgo ^ 2) + I(city.mpg ^ 2) + I(highway.MPG ^ 2) + I(Popularity ^ 2),
data = data_train)
MSRP_mod_both_aic_poly = step(MSRP_big_mod_poly, direction = "both", trace = 0)
Creating Linear Model with AIC Step in both directions
MSRP_big_mod_linear = lm(
MSRP ~ . ,
data = data_train)
MSRP_mod_both_aic_lin = step(MSRP_big_mod_linear, direction = "both", trace = 0)
summary(MSRP_mod_both_aic_lin)$r.sq
## [1] 0.8264428
Creating Linear Model with BIC Step in both directions
msrp_mod_start = lm(MSRP ~ ., data = data_train)
n = length(resid(msrp_mod_start))
msrp_mod_linear_both_bic = step(msrp_mod_start, direction = "both", k = log(n), trace = 0)
Creating Quadratic Model with BIC Step in both directions
msrp_mod_start2 = lm(MSRP ~ . + I(Engine.HP^2) + I(city.mpg^2) + I(highway.MPG^2) + I(Popularity^2) + I( ReleasedYearsAgo^2), data = data_train)
n = length(resid(msrp_mod_start2))
msrp_mod_poly_both_bic = step( msrp_mod_start2, direction = "both", k = log(n), trace = 0)
Defining a function to calculate the calc_loocv_rmse
calc_loocv_rmse = function(model) {
sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
We create a table to compare the metrics calc_loocv_rmse, adjusted r squared, and r squared of the 5 models we have created. Based on this, we will decide which model to pursue futher.
#this table compares the calc_loocv_rmse, adjusted r squared, and r squared
to_insert_1 = c("msrp_mod_additive", calc_loocv_rmse(msrp_mod_additive),
summary(msrp_mod_additive)$adj.r.sq, summary(msrp_mod_additive)$r.sq)
to_insert_2 = c("MSRP_mod_both_aic_lin", calc_loocv_rmse(MSRP_mod_both_aic_lin) ,summary(MSRP_mod_both_aic_lin)$adj.r.sq,summary(MSRP_mod_both_aic_lin)$r.sq )
to_insert_3 = c("MSRP_mod_both_aic_poly", calc_loocv_rmse(MSRP_mod_both_aic_poly) ,summary(MSRP_mod_both_aic_poly)$adj.r.sq,summary(MSRP_mod_both_aic_poly)$r.sq )
to_insert_4 = c("msrp_mod_linear_both_bic", calc_loocv_rmse(msrp_mod_linear_both_bic) ,summary(msrp_mod_linear_both_bic)$adj.r.sq,summary(msrp_mod_linear_both_bic)$r.sq )
to_insert_5 = c("msrp_mod_poly_both_bic", calc_loocv_rmse(msrp_mod_poly_both_bic) ,summary(msrp_mod_poly_both_bic)$adj.r.sq,summary(msrp_mod_poly_both_bic)$r.sq )
dataframe.values = c(to_insert_1, to_insert_2, to_insert_3, to_insert_4, to_insert_5)
dataframe = matrix(dataframe.values,nrow=5 ,byrow = T)
colnames(dataframe) = c("Model Name","calc_loocv_rmse","Adj. R-Sq.", "R-Sq.")
datatable(dataframe)
Listing the composition of all the models
msrp_mod_additive$call
## lm(formula = MSRP ~ ., data = data_train)
MSRP_mod_both_aic_lin$call
## lm(formula = MSRP ~ Make + Engine.Fuel.Type + Engine.HP + Engine.Cylinders +
## Transmission.Type + Driven_Wheels + Number.of.Doors + Vehicle.Size +
## highway.MPG + city.mpg + ReleasedYearsAgo, data = data_train)
MSRP_mod_both_aic_poly$call
## lm(formula = MSRP ~ Make + Engine.Fuel.Type + Engine.HP + Engine.Cylinders +
## Transmission.Type + Driven_Wheels + Number.of.Doors + Vehicle.Size +
## highway.MPG + ReleasedYearsAgo + I(Engine.HP^2) + I(ReleasedYearsAgo^2) +
## I(city.mpg^2) + I(highway.MPG^2), data = data_train)
msrp_mod_linear_both_bic$call
## lm(formula = MSRP ~ Make + Engine.Fuel.Type + Engine.HP + Engine.Cylinders +
## Transmission.Type + Driven_Wheels + Vehicle.Size + highway.MPG +
## city.mpg + ReleasedYearsAgo, data = data_train)
msrp_mod_poly_both_bic$call
## lm(formula = MSRP ~ Make + Engine.Fuel.Type + Engine.HP + Engine.Cylinders +
## Transmission.Type + Driven_Wheels + Vehicle.Size + highway.MPG +
## ReleasedYearsAgo + I(Engine.HP^2) + I(city.mpg^2) + I(highway.MPG^2),
## data = data_train)
We see that these are all quite similar models, and the two which are somewhat different are MSRP_mod_both_aic_poly and msrp_mod_poly_both_bic. These two consistently give marginally better values for the metrics.
We choose to pursue the model MSRP_mod_both_aic_poly for now since it uses an extra variable (Number.of.Doors) and has a very slightly better standing according to adjusted R-Squared
#Renaming the model for convenience:
chosen_model = MSRP_mod_both_aic_poly
We can consider removing influential points from our model
cd_chosen_mod = cooks.distance(chosen_model)
length(cd_chosen_mod[cd_chosen_mod > 4 / length(cd_chosen_mod) ])
## [1] 287
cox_sub = cd_chosen_mod <= 4/length(cd_chosen_mod)
Of the 4817 points in our chosen model, 287 appear to be influential. We can consider removing these from our model
We reload the model, with the influential points removed:
chosen_model_uninf = lm(formula = MSRP ~ Make + Engine.Fuel.Type + Engine.HP + Engine.Cylinders +
Transmission.Type + Driven_Wheels + Number.of.Doors + Vehicle.Size +
highway.MPG + ReleasedYearsAgo + I(Engine.HP^2) + I(ReleasedYearsAgo^2) +
I(city.mpg^2) + I(highway.MPG^2), data = data_train, subset = cox_sub)
summary(chosen_model_uninf)$adj.r.sq
## [1] 0.8677063
We can see that this removal of influential points has increased the adjusted R-squared from 0.83 to 0.87, which is expected and encouraging.
Creating a few functions to evaluate the model assumptions:
The function below, assumption_tester, will help us reduce code repetition and assist with creating a thorough model.
plot_func = function(model, pointcol = "slateblue3",linecol = "limegreen") {
plot(fitted(model), resid(model), col = pointcol, pch = 20, xlab = "Fitted", ylab = "Residuals", main = "Fitted vs Residuals")
abline(h = 0, col = linecol, lwd = 4)
}
assumption_tester = function(model) {
qqnorm(resid(model), main = "Normal Q-Q Plot", col = "darkgrey")
qqline(resid(model), col = "dodgerblue", lwd = 2)
#normality test
print("Shapiro test:")
print(shapiro.test(resid(model)[0:5000]))
hist(model$resid, col = "skyblue3", main = "Histogram of Residuals")
#multicollinearity
vif_vals = vif(model)
print(vif_vals[vif_vals > 5])
#Constant Variance
plot_func(model)
print("Breusch-Pagan test:")
print(bptest(model))
print("Adjusted R-Squared:")
print(summary(model)$r.sq)
print("Significance Test P-Value:")
print(pf(summary(model)$fstatistic[1],summary(model)$fstatistic[2],summary(model)$fstatistic[3],lower.tail=FALSE))
}
assumption_tester(chosen_model_uninf)
## [1] "Shapiro test:"
##
## Shapiro-Wilk normality test
##
## data: resid(model)[0:5000]
## W = 0.99567, p-value = 2.629e-10
## Engine.HP Engine.Cylinders4 Engine.Cylinders5
## 63.322818 120.649645 11.993765
## Engine.Cylinders6 Engine.Cylinders8 highway.MPG
## 121.534664 38.292598 146.696486
## ReleasedYearsAgo I(Engine.HP^2) I(ReleasedYearsAgo^2)
## 31.943300 56.968450 29.586542
## I(city.mpg^2) I(highway.MPG^2)
## 6.163913 148.458384
## [1] "Breusch-Pagan test:"
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 637.84, df = 50, p-value < 2.2e-16
##
## [1] "Adjusted R-Squared:"
## [1] 0.8691668
## [1] "Significance Test P-Value:"
## value
## 0
The Shapiro-Wilk test and the qq plot indicate non-normality of errors
The constant variance assumption (based on the fitted vs residuals graph and the BP-test) also does appear to be violated
We can see that the VIF values for quite a few of the predictors are fairly high if we go with a threshold of 5. There does appear to be high multicollinearity
The p-value of the overall significance test is almost 0, which means that the model is definitely significant
We can consider the Box-Cox Transformation method since our response variable (MSRP) is strictly positive
boxcox(chosen_model_uninf, plotit = TRUE, lambda = seq(0.2, 0.5, by = 0.1))
From this, we see that λ = 0.4 is extremely close to the maximum and within the confidence interval.
We can now fit a model with the transformation of λ = 0.4 applied to the response variable:
chosen_model_trans1 = lm(formula = (((MSRP ^ 0.4) - 1) / 0.4) ~ Make + Engine.Fuel.Type + Engine.HP + Engine.Cylinders +
Transmission.Type + Driven_Wheels + Number.of.Doors + Vehicle.Size +
highway.MPG + ReleasedYearsAgo + I(Engine.HP^2) + I(ReleasedYearsAgo^2) +
I(city.mpg^2) + I(highway.MPG^2), data = data_train, subset = cox_sub)
assumption_tester(chosen_model_trans1)
## [1] "Shapiro test:"
##
## Shapiro-Wilk normality test
##
## data: resid(model)[0:5000]
## W = 0.99833, p-value = 0.0001015
## Engine.HP Engine.Cylinders4 Engine.Cylinders5
## 63.322818 120.649645 11.993765
## Engine.Cylinders6 Engine.Cylinders8 highway.MPG
## 121.534664 38.292598 146.696486
## ReleasedYearsAgo I(Engine.HP^2) I(ReleasedYearsAgo^2)
## 31.943300 56.968450 29.586542
## I(city.mpg^2) I(highway.MPG^2)
## 6.163913 148.458384
## [1] "Breusch-Pagan test:"
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 416.05, df = 50, p-value < 2.2e-16
##
## [1] "Adjusted R-Squared:"
## [1] 0.8743651
## [1] "Significance Test P-Value:"
## value
## 0
Since our λ is somewhat close to 0, we also consider a log transformation on the dependant variable:
chosen_model_trans2 = lm(formula = log(MSRP) ~ Make + Engine.Fuel.Type + Engine.HP + Engine.Cylinders +
Transmission.Type + Driven_Wheels + Number.of.Doors + Vehicle.Size +
highway.MPG + ReleasedYearsAgo + I(Engine.HP^2) + I(ReleasedYearsAgo^2) +
I(city.mpg^2) + I(highway.MPG^2), data = data_train, subset = cox_sub)
assumption_tester(chosen_model_trans2)
## [1] "Shapiro test:"
##
## Shapiro-Wilk normality test
##
## data: resid(model)[0:5000]
## W = 0.99945, p-value = 0.2143
## Engine.HP Engine.Cylinders4 Engine.Cylinders5
## 63.322818 120.649645 11.993765
## Engine.Cylinders6 Engine.Cylinders8 highway.MPG
## 121.534664 38.292598 146.696486
## ReleasedYearsAgo I(Engine.HP^2) I(ReleasedYearsAgo^2)
## 31.943300 56.968450 29.586542
## I(city.mpg^2) I(highway.MPG^2)
## 6.163913 148.458384
## [1] "Breusch-Pagan test:"
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 390.55, df = 50, p-value < 2.2e-16
##
## [1] "Adjusted R-Squared:"
## [1] 0.8722616
## [1] "Significance Test P-Value:"
## value
## 0
Using the log transformation, we have satisfied the normality assumption.
We can see that we have improved the p-value of the Shapiro-Wilk normality test from 7.194e-16 to a value of 0.214. This is a substantial improvement.
Additionally, the histogram of the residuals and the Q-Q Plot are much better than before, both indicating the same progress.
While the adjusted R-Squared for the log transformed model is slightly lower, we still prefer this model since it satisfies the normality assumption
summary(chosen_model_trans2)
##
## Call:
## lm(formula = log(MSRP) ~ Make + Engine.Fuel.Type + Engine.HP +
## Engine.Cylinders + Transmission.Type + Driven_Wheels + Number.of.Doors +
## Vehicle.Size + highway.MPG + ReleasedYearsAgo + I(Engine.HP^2) +
## I(ReleasedYearsAgo^2) + I(city.mpg^2) + I(highway.MPG^2),
## data = data_train, subset = cox_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45790 -0.07453 -0.00040 0.07594 0.35047
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.7676387809 0.0704098475 138.725 < 2e-16 ***
## MakeAudi 0.1107583592 0.0232007674 4.774 1.86e-06 ***
## MakeBMW 0.1096599657 0.0175191745 6.259 4.23e-10 ***
## MakeBuick -0.0418663294 0.0169100484 -2.476 0.013330 *
## MakeCadillac 0.0526916848 0.0170312350 3.094 0.001988 **
## MakeChevrolet -0.1985736265 0.0128339340 -15.473 < 2e-16 ***
## MakeChrysler -0.1399640279 0.0176383653 -7.935 2.63e-15 ***
## MakeDodge -0.2256452739 0.0139546715 -16.170 < 2e-16 ***
## MakeFIAT -0.1123017990 0.0243513544 -4.612 4.10e-06 ***
## MakeFord -0.1690131105 0.0135308433 -12.491 < 2e-16 ***
## MakeGMC -0.1279845968 0.0148701591 -8.607 < 2e-16 ***
## MakeHonda -0.1249695163 0.0133014658 -9.395 < 2e-16 ***
## MakeHyundai -0.2140082217 0.0144463396 -14.814 < 2e-16 ***
## MakeInfiniti -0.0537307808 0.0151452958 -3.548 0.000393 ***
## MakeKia -0.2788716101 0.0148285339 -18.806 < 2e-16 ***
## MakeLand Rover 0.1142652934 0.0243723592 4.688 2.84e-06 ***
## MakeLexus 0.0247312276 0.0179468161 1.378 0.168263
## MakeLincoln 0.0160089262 0.0179552902 0.892 0.372656
## MakeMazda -0.1484751845 0.0139165925 -10.669 < 2e-16 ***
## MakeMercedes-Benz 0.0705314013 0.0258655435 2.727 0.006419 **
## MakeMitsubishi -0.1844906873 0.0160122514 -11.522 < 2e-16 ***
## MakeNissan -0.2168934368 0.0133320396 -16.269 < 2e-16 ***
## MakeOldsmobile -0.1049059605 0.0241980639 -4.335 1.49e-05 ***
## MakePontiac -0.1710689109 0.0163736255 -10.448 < 2e-16 ***
## MakeSaab 0.1104256177 0.0222776313 4.957 7.43e-07 ***
## MakeScion -0.2624016973 0.0215446689 -12.179 < 2e-16 ***
## MakeSubaru -0.1658861681 0.0150831921 -10.998 < 2e-16 ***
## MakeSuzuki -0.2823760520 0.0142962454 -19.752 < 2e-16 ***
## MakeToyota -0.1722086381 0.0131441043 -13.102 < 2e-16 ***
## MakeVolkswagen -0.0471424062 0.0132173963 -3.567 0.000365 ***
## MakeVolvo 0.0470279672 0.0160942208 2.922 0.003495 **
## Engine.Fuel.Typegasoline -0.3117598095 0.0237817269 -13.109 < 2e-16 ***
## Engine.HP 0.0061797562 0.0001917736 32.224 < 2e-16 ***
## Engine.Cylinders4 -0.0436305007 0.0363241341 -1.201 0.229759
## Engine.Cylinders5 -0.1379095357 0.0387686101 -3.557 0.000379 ***
## Engine.Cylinders6 -0.0486212999 0.0374257645 -1.299 0.193963
## Engine.Cylinders8 -0.0585019901 0.0389228317 -1.503 0.132903
## Transmission.TypeMANUAL -0.0955015224 0.0047148535 -20.255 < 2e-16 ***
## Driven_Wheelsfour wheel drive -0.0003423314 0.0080857960 -0.042 0.966232
## Driven_Wheelsfront wheel drive -0.0558768577 0.0058075086 -9.621 < 2e-16 ***
## Driven_Wheelsrear wheel drive -0.0478364683 0.0063905758 -7.485 8.53e-14 ***
## Number.of.Doors3 -0.0513270853 0.0156852390 -3.272 0.001075 **
## Number.of.Doors4 -0.0167786129 0.0052880774 -3.173 0.001519 **
## Vehicle.SizeLarge 0.0619074197 0.0073336436 8.442 < 2e-16 ***
## Vehicle.SizeMidsize 0.0542005172 0.0046440184 11.671 < 2e-16 ***
## highway.MPG 0.0144346837 0.0032194951 4.484 7.53e-06 ***
## ReleasedYearsAgo -0.0172179285 0.0019569398 -8.798 < 2e-16 ***
## I(Engine.HP^2) -0.0000072626 0.0000003798 -19.125 < 2e-16 ***
## I(ReleasedYearsAgo^2) 0.0003177338 0.0000929580 3.418 0.000636 ***
## I(city.mpg^2) 0.0003050352 0.0000137487 22.186 < 2e-16 ***
## I(highway.MPG^2) -0.0005023177 0.0000551703 -9.105 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1112 on 4479 degrees of freedom
## Multiple R-squared: 0.8723, Adjusted R-squared: 0.8708
## F-statistic: 611.7 on 50 and 4479 DF, p-value: < 2.2e-16
print("VIF:")
## [1] "VIF:"
vif(chosen_model_trans2)[vif(chosen_model_trans2) > 5]
## Engine.HP Engine.Cylinders4 Engine.Cylinders5
## 63.322818 120.649645 11.993765
## Engine.Cylinders6 Engine.Cylinders8 highway.MPG
## 121.534664 38.292598 146.696486
## ReleasedYearsAgo I(Engine.HP^2) I(ReleasedYearsAgo^2)
## 31.943300 56.968450 29.586542
## I(city.mpg^2) I(highway.MPG^2)
## 6.163913 148.458384
As we can see from the VIF results, there are a few predictors which show signs of multicollinearity. We can start by removing the predictor Engine.Cylinders, since all the dummy variables associated to it have high VIFs and their p-values are also fairly high (from the summary).
chosen_model_drop1 = lm(formula = log(MSRP) ~ Make + Engine.Fuel.Type + Engine.HP +
Transmission.Type + Driven_Wheels + Number.of.Doors +
Vehicle.Size + highway.MPG + ReleasedYearsAgo + I(Engine.HP^2) +
I(ReleasedYearsAgo^2) + I(city.mpg^2) + I(highway.MPG^2),
data = data_train, subset = cox_sub)
assumption_tester(chosen_model_drop1)
## [1] "Shapiro test:"
##
## Shapiro-Wilk normality test
##
## data: resid(model)[0:5000]
## W = 0.99938, p-value = 0.1289
## Engine.HP highway.MPG ReleasedYearsAgo
## 53.442288 120.303145 31.123693
## I(Engine.HP^2) I(ReleasedYearsAgo^2) I(city.mpg^2)
## 45.449486 28.923387 6.023574
## I(highway.MPG^2)
## 126.028690
## [1] "Breusch-Pagan test:"
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 404.2, df = 46, p-value < 2.2e-16
##
## [1] "Adjusted R-Squared:"
## [1] 0.8706557
## [1] "Significance Test P-Value:"
## value
## 0
summary(chosen_model_drop1)
##
## Call:
## lm(formula = log(MSRP) ~ Make + Engine.Fuel.Type + Engine.HP +
## Transmission.Type + Driven_Wheels + Number.of.Doors + Vehicle.Size +
## highway.MPG + ReleasedYearsAgo + I(Engine.HP^2) + I(ReleasedYearsAgo^2) +
## I(city.mpg^2) + I(highway.MPG^2), data = data_train, subset = cox_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45512 -0.07497 -0.00005 0.07508 0.35631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.7250813801 0.0605571296 160.593 < 2e-16 ***
## MakeAudi 0.1110770172 0.0232462929 4.778 1.82e-06 ***
## MakeBMW 0.1079299399 0.0175856172 6.137 9.11e-10 ***
## MakeBuick -0.0403538836 0.0169900449 -2.375 0.017584 *
## MakeCadillac 0.0546183866 0.0170790983 3.198 0.001394 **
## MakeChevrolet -0.1981242912 0.0128838896 -15.378 < 2e-16 ***
## MakeChrysler -0.1370541186 0.0177058488 -7.741 1.21e-14 ***
## MakeDodge -0.2235764739 0.0140194782 -15.948 < 2e-16 ***
## MakeFIAT -0.1122386489 0.0244784068 -4.585 4.66e-06 ***
## MakeFord -0.1664377445 0.0135726624 -12.263 < 2e-16 ***
## MakeGMC -0.1290403786 0.0148909178 -8.666 < 2e-16 ***
## MakeHonda -0.1256082944 0.0133676601 -9.396 < 2e-16 ***
## MakeHyundai -0.2133292758 0.0145039465 -14.708 < 2e-16 ***
## MakeInfiniti -0.0510253133 0.0151238231 -3.374 0.000748 ***
## MakeKia -0.2771430254 0.0148813099 -18.624 < 2e-16 ***
## MakeLand Rover 0.1136093096 0.0242091667 4.693 2.77e-06 ***
## MakeLexus 0.0251829880 0.0180500499 1.395 0.163032
## MakeLincoln 0.0151029681 0.0179361481 0.842 0.399810
## MakeMazda -0.1459914711 0.0139543739 -10.462 < 2e-16 ***
## MakeMercedes-Benz 0.0707666570 0.0258645414 2.736 0.006243 **
## MakeMitsubishi -0.1775568995 0.0158440508 -11.207 < 2e-16 ***
## MakeNissan -0.2167346373 0.0134050324 -16.168 < 2e-16 ***
## MakeOldsmobile -0.1040426487 0.0243372735 -4.275 1.95e-05 ***
## MakePontiac -0.1684476029 0.0164490172 -10.241 < 2e-16 ***
## MakeSaab 0.1147520504 0.0222312297 5.162 2.55e-07 ***
## MakeScion -0.2583758349 0.0216461877 -11.936 < 2e-16 ***
## MakeSubaru -0.1651884169 0.0150943396 -10.944 < 2e-16 ***
## MakeSuzuki -0.2764005317 0.0142936404 -19.337 < 2e-16 ***
## MakeToyota -0.1717406939 0.0132121360 -12.999 < 2e-16 ***
## MakeVolkswagen -0.0621607819 0.0130519756 -4.763 1.97e-06 ***
## MakeVolvo 0.0221892958 0.0156048962 1.422 0.155112
## Engine.Fuel.Typegasoline -0.3170970491 0.0238417591 -13.300 < 2e-16 ***
## Engine.HP 0.0062622071 0.0001772027 35.339 < 2e-16 ***
## Transmission.TypeMANUAL -0.0910929333 0.0046733962 -19.492 < 2e-16 ***
## Driven_Wheelsfour wheel drive -0.0025212844 0.0080995472 -0.311 0.755597
## Driven_Wheelsfront wheel drive -0.0563569300 0.0057772135 -9.755 < 2e-16 ***
## Driven_Wheelsrear wheel drive -0.0486545809 0.0063857487 -7.619 3.09e-14 ***
## Number.of.Doors3 -0.0488216227 0.0157235908 -3.105 0.001915 **
## Number.of.Doors4 -0.0155146314 0.0053139702 -2.920 0.003522 **
## Vehicle.SizeLarge 0.0612643297 0.0070366642 8.706 < 2e-16 ***
## Vehicle.SizeMidsize 0.0566746428 0.0046221716 12.261 < 2e-16 ***
## highway.MPG 0.0140228816 0.0029324830 4.782 1.79e-06 ***
## ReleasedYearsAgo -0.0194761609 0.0019429081 -10.024 < 2e-16 ***
## I(Engine.HP^2) -0.0000074633 0.0000003412 -21.876 < 2e-16 ***
## I(ReleasedYearsAgo^2) 0.0004130276 0.0000924450 4.468 8.10e-06 ***
## I(city.mpg^2) 0.0003028870 0.0000136704 22.156 < 2e-16 ***
## I(highway.MPG^2) -0.0004876661 0.0000511277 -9.538 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1119 on 4483 degrees of freedom
## Multiple R-squared: 0.8707, Adjusted R-squared: 0.8693
## F-statistic: 656 on 46 and 4483 DF, p-value: < 2.2e-16
Our Fitted vs Residuals plot still does not look perfect, but the changes we have made have tremendously improved it. It has a diagonal dip at the top right, which appears to be due to the limit we placed on the prices (this guess is based on some reading online).
Our Normal Q-Q Plot looks good, and so does the histogram of residuals. These are, once again, big improvements on the original assumption plots we had created.
We choose to keep the remaining predictors with high VIF values. We make this decision since all the predictors with high VIFs are probably caused due to the inclusion of polynomial predictors of the same variables, and all of these predictors are significant to the model (as visible in the summary). So, it makes sense to retain them.
Once again, though our adjusted R Squared value has slightly decreased, we still prefer this model with the dropped predictor.
We choose this model, chosen_model_drop1, as our final model for the purpose of this report.
We have chosen the model “chosen_model_drop1” as our final model. Here are the predictors it utilizes:
Here, we make predictions using the model we have chosen. We will use our test set, and then plot the predicted MSRP values against the actual MSRP values
Retrieving the predictions made by our model on the test data
test_predictions_log = predict(chosen_model_drop1, newdata = data_test, type = "resp")
test_predictions = exp(test_predictions_log)
Plotting:
data = data.frame(
x= test_predictions,
y= data_test$MSRP
)
plot(data$x, data$y,
pch=1,
cex=1,
col="paleturquoise3",
xlab="Predicted Value of MSRP", ylab="Actual Value of MSRP",
main="Predicted vs Actual MSRP"
)
abline(0,1, col="navyblue", lwd = 2)
Since we are plotting the predicted vs the actual MSRPs of the test data, we expect the chart to be oriented at 45 degrees positively, as depicted by the dark blue line. Our predictions are fairly closely matched to the actual values of the MSRP. The general trend of the predictions is what we expect it to look like.
It does seem that this model slightly overestimates the MSRP for low prices, and underestimates for the high prices. This is an interesting thing to note. Also, there is a minor increase in magnitude of absolute errors as the MSRP increases, which is expected.
Let us assume that BMW is planning to launch a new car with certain specifications as detailed in the cariable newdata_bmw below. We can use a model to predict the price of this BMW with our model! Creating a prediction interval:
newdata_bmw = data.frame(Make="BMW", Engine.Fuel.Type="gasoline", Engine.HP = 220, Transmission.Type = "AUTOMATIC", Driven_Wheels = "rear wheel drive" , Number.of.Doors = "4", Vehicle.Size = "Midsize", highway.MPG = 25, city.mpg = 20, ReleasedYearsAgo = 0)
log_pred_MSRP_bmw = predict(chosen_model_drop1, newdata_bmw, interval="predict", level = 0.9)
exp(log_pred_MSRP_bmw)
## fit lwr upr
## 1 43994.31 36524.49 52991.81
A 90% prediction interval for the MSRP of such a BMW is (36524.49, 52991.81)
Now, let’s say that Honda wishes to release a car with the exact same specifications:
newdata_honda = data.frame(Make="Honda", Engine.Fuel.Type="gasoline", Engine.HP = 220, Transmission.Type = "AUTOMATIC", Driven_Wheels = "rear wheel drive" , Number.of.Doors = "4", Vehicle.Size = "Midsize", highway.MPG = 25, city.mpg = 20, ReleasedYearsAgo = 0)
log_pred_MSRP_hon = predict(chosen_model_drop1, newdata_honda, interval="predict", level = 0.9)
exp(log_pred_MSRP_hon)
## fit lwr upr
## 1 34831.5 28943.97 41916.61
A 90% prediction interval for the MSRP of such a Honda is only (28943.97 41916.61)!
Based on our model, a Honda is priced much lower than a BMW even though both of them have the exact specifications. This may not be surprising, but that is actually good! It is a common belief that similarly “specced” cars cost more when bought from premium brands. Our model says so too!
Let’s say that a car buyer is in the market for a Honda of these rough specifications, and they see a deal for a model which meets these specifications for $29,999. They can say with some confidence that they have a good deal based on our regression model! Such predictions can help consumers make very informed decisions, which is one of the potential use cases of our project.
This model restricts itself to gasoline and diesel vehicles. In a time where electric cars are all the rage, given the right data and a longer time period, we can consider creating a model for electric cars as well.
There are many other car features, knowledge of which can potentially help make better predictions. Examples of such predictors include engine torque, interior materials, safety ratings, engine aspiration (turbocharged, supercharged, natural), seating capacity, and many others.
There is an even bigger potential in the market of used cars. There, factors like age, mileage, crash-history, owner count, exterior and interior condition, and many different things come into additional consideration. That is a very interesting direction in which such a project can be taken!
Team Members
A big thanks to the Professor and the course staff for a great semester in STAT-420 from our team!